feat(schema): document #355

luoling8192 · 2025-09-16T17:51:35Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @luoling8192, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a new feature for document processing and storage. It introduces a dedicated "documents" table in the database to house scraped web content, along with its processed versions, summaries, and vector embeddings for efficient retrieval. The existing link resolver has been enhanced to automatically detect URLs in chat messages, scrape their content, and persist this information, enabling richer data analysis and search capabilities within the application.

Highlights

Document Storage: Introduced a new "documents" table in the database to store scraped web content, summaries, and vector embeddings.
Web Scraping Integration: The "link-resolver" now automatically scrapes content from URLs found in chat messages using r.jina.ai and stores it in the new "documents" table.
New Dependencies: Added @mendable/firecrawl-js to facilitate web scraping, along with its associated transitive dependencies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

netlify · 2025-09-16T17:52:20Z

✅ Deploy Preview for tgsearch ready!

Name	Link
🔨 Latest commit	`19e598c`
🔍 Latest deploy log	https://app.netlify.com/projects/tgsearch/deploys/68c9a3a93114af0008bbfe53
😎 Deploy Preview	https://deploy-preview-355--tgsearch.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

gemini-code-assist

Code Review

This pull request introduces a new documents table and related functionality to scrape and store content from links found in messages. It adds a link resolver using Jina's reader service and a new dependency firecrawl-js. While this is a good feature addition, the review has identified several critical issues in the implementation of document retrieval, potential data integrity problems with foreign keys, and areas for improvement in the link resolver and schema definitions. Addressing these points will improve the correctness, robustness, and maintainability of the new feature.

gemini-code-assist · 2025-09-16T17:54:32Z

packages/core/src/models/documents.ts

+    const relevantDocs = (await withDb(async db => db
+      .select()
+      .from(documentsTable)
+      .where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`)


The text search query sql${documentsTable.processed_content_jieba_tokens} @> ${content.text}`` is incorrect. The content.text is a raw string, but a `jsonb` containment query (`@>`) on `processed_content_jieba_tokens` expects a JSONB array of tokens. The input text should be tokenized first (e.g., using Jieba), and then the tokens should be passed to the SQL query as a JSON array string, for example: `sql`... @> ${JSON.stringify(tokens)}::jsonb`.

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/models/documents.ts

+    const relevantDocs = (await withDb(async db => db
+      .select()
+      .from(documentsTable)
+      .where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)


The vector similarity operator <=> is being used in a WHERE clause. This operator calculates the distance and is typically used in an ORDER BY clause to sort results by similarity. Using it in a WHERE clause without a comparison is incorrect. You should use orderBy to sort by distance.

Suggested change

.where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)

.orderBy(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/models/documents.ts

+  const documents = []
+
+  if (content.text) {
+    const relevantDocs = (await withDb(async db => db
+      .select()
+      .from(documentsTable)
+      .where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`)
+      .limit(pagination?.limit || 10)
+      .offset(pagination?.offset || 0),
+    )).unwrap()
+    logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved jieba documents')
+    documents.push(...relevantDocs)
+  }
+
+  if (content.embedding && content.embedding.length > 0) {
+    const relevantDocs = (await withDb(async db => db
+      .select()
+      .from(documentsTable)
+      .where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)
+      .limit(pagination?.limit || 10)
+      .offset(pagination?.offset || 0),
+    )).unwrap()
+    logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved vector documents')
+    documents.push(...relevantDocs)
+  }


The retrieveDocuments function performs two separate database queries for text and vector search, and then concatenates the results. This approach has several issues:

Duplicates: It can return duplicate documents if a document matches both text and vector queries.

Incorrect Pagination: The limit and offset are applied to each query separately, not to the combined result set. This means you could get up to 2 * limit results, and pagination will not work as expected.

Inconsistent Ranking: The results from the two queries are not ranked against each other; they are simply appended. This can lead to less relevant results appearing before more relevant ones.

A better approach would be to combine these into a single query if possible, or fetch results and then merge, de-duplicate, and re-rank them in the application before applying pagination.

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/schemas/documents.ts

+
+export const documentsTable = pgTable('documents', {
+  id: uuid().primaryKey().defaultRandom(),
+  chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id),


The foreign key on chat_messages_id defaults to ON DELETE NO ACTION. This will cause an error when trying to delete a chat message that has associated documents, which can lead to data integrity issues or unexpected application behavior. If documents should be deleted along with their corresponding message, consider using onDelete: 'cascade'.

Suggested change

chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id),

chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id, { onDelete: 'cascade' }),

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/message-resolvers/link-resolver.ts

-      return Ok([] as CoreMessage[])
+      for (const message of opts.messages) {
+        if (message.content) {
+          const links = message.content.match(/https?:\/\/\S+/g)


The regex https?:\/\/\S+ for extracting links is a bit too greedy. It can incorrectly include trailing punctuation like commas, periods, or closing parentheses as part of the URL, which can lead to failed scraping attempts. Consider cleaning the matched URLs to remove such trailing characters.

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/message-resolvers/link-resolver.ts

+          if (links && links.length > 0) {
+            for (const link of links) {
+              try {
+                const response = await fetch(`https://r.jina.ai/${link}`, {


The URL for the Jina reader service is hardcoded. It's better to extract this into a constant or a configuration variable to make it easier to change or update in the future.

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/message-resolvers/link-resolver.ts

+                await recordDocuments([{
+                  chatMessagesId: message.uuid,
+                  rawContent: doc,
+                  processedContent: doc,
+                  processedContentJiebaTokens: [],
+                  summary: doc,
+                }])


The scraped document is used for rawContent, processedContent, and summary without any processing. Also, processedContentJiebaTokens is an empty array. For these fields to be effective for search and retrieval, the content should be processed (e.g., cleaned, tokenized) and a summary should be generated. This implementation seems like a placeholder and should be enhanced for the feature to be fully functional.

gemini-code-assist · 2025-09-16T17:54:33Z

packages/core/src/schemas/documents.ts

+}, table => [
+  index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),
+  index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),
+  index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),
+  index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),
+])


The recordDocuments function in packages/core/src/models/documents.ts uses onConflictDoUpdate with chat_messages_id as the target. This implies that chat_messages_id should be unique. To enforce this at the database level and for clarity, you should add a unique index on this column.

Suggested change

}, table => [

index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),

index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),

index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),

index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),

])

}, table => [

uniqueIndex('documents_chat_messages_id_unique_index').on(table.chat_messages_id),

index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),

index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),

index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),

index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),

])

luoling8192 added 2 commits September 17, 2025 01:49

feat(schema): document

a9b630a

feat: jina link resolver

19e598c

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

luoling8192 marked this pull request as draft September 16, 2025 17:52

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(schema): document #355

feat(schema): document #355

luoling8192 commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

netlify bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	.where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)
	.orderBy(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)

	chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id),
	chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id, { onDelete: 'cascade' }),

Uh oh!

feat(schema): document #355

Are you sure you want to change the base?

feat(schema): document #355

Conversation

luoling8192 commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

netlify bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for tgsearch ready!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Sep 16, 2025 •

edited

Loading